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Abstract 


So far, boosting has been used to improve the quality of moderately accurate learning algorithms, 
by weighting and combining many of their weak hypotheses into a final classifier with theoretically 
high accuracy. In a recent work (Sebban, Nock and Lallich, 2001), we have attempted to adapt 
boosting properties to data reduction techniques. In this particular context, the objective was not 
only to improve the success rate, but also to reduce the time and space complexities due to the 
storage requirements of some costly learning algorithms, such as nearest-neighbor classifiers. In 
that framework, each weak hypothesis, which is usually built and weighted from the learning set, 
is replaced by a single learning instance. The weight given by boosting defines in that case the 
relevance of the instance, and a statistical test allows one to decide whether it can be discarded 
without damaging further classification tasks. In Sebban, Nock and Lallich (2001), we addressed 
problems with two classes. It is the aim of the present paper to relax the class constraint, and 
extend our contribution to multiclass problems. Beyond data reduction, experimental results are 
also provided on twenty-three datasets, showing the benefits that our boosting-derived weighting 
rule brings to weighted nearest neighbor classifiers. 


1. Introduction 


Some of the earliest approaches to classification are also among the simplest: they do not induce 
concept representations (decision trees, neural networks, etc.), but exploit simple structures of the 
learning set, such as neighborhoods, to classify instances. Among them, the most popular is prob- 
ably the 1-Nearest-Neighbor (NN) algorithm (Cover and Hart, 1967), and its generalization, the 
k-NN rule, which classifies an unknown instance according to a local vote by its k-nearest neigh- 
bors. Its use was widely spread and encouraged by early theoretical results linking its generalization 
error to Bayes risk. Under mild regularity assumptions on the underlying statistics, for any metric, 
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the large-sample risk incurred is less than twice the Bayes risk. Even more, the risk paid off for 
finite samples can be very reasonable under similar assumptions (Nock and Sebban, 2001b). How- 
ever, from a practical point of view, this algorithm has several problems, as mentioned in Breiman 
et al. (1984): (i) it is computationally expensive because it stores all the instances in memory; (ii) 
it is intolerant to noisy instances; (iii) it is intolerant to irrelevant attributes and (iv) it is sensitive to 
the chosen distance function. 

The deletion of noisy instances and irrelevant attributes is addressed by data reduction tech- 
niques. Recent complexity theoretic results show that some related optimization problems are very 
hard to approximate (Nock and Sebban, 2000). This advocates for the use of heuristics for data 
reduction. In this paper, we only focus on prototype selection, which consists of identifying and 
eliminating irrelevant instances. Prototype selection concerns both storage complexity (first prob- 
lem listed above) and noise tolerance (second problem). The last two problems are not discussed in 
this paper. Many solutions have been proposed to select relevant features (John, Kohavi and Pfleger, 
1994; Koller and Sahami, 1996; Sebban, 1999) and to define new distance functions (Wilson and 
Martinez, 1997). 

Many prototype selection methods have been suggested to improve the standard NN algorithm 
using different strategies: removing correctly classified examples (Hart, 1968; Gates, 1972), iden- 
tifying and eliminating mislabeled instances (Brodley and Friedl, 1996), deleting misclassified or 
irrelevant instances (Wilson and Martinez, 2000; Sebban and Nock, 2000), identifying relevant 
prototypes by Monte-Carlo sampling (Skalak, 1994), etc. Recently, we proposed an adaptation of 
boosting to prototype selection (Nock and Sebban, 2001a) in the PSBOOST algorithm. Boosting, 
as used in the well known ADABOOST algorithm (Freund and Schapire, 1997), generates a final 
combined classifier whose error on the learning set is small by weighting and combining T weak 
hypotheses, each of which may have a large error. Here, T is the number of boosting rounds, a 
parameter fixed in advance. Freund and Schapire (1996) proposed reducing the number of instances 
used by each weak hypothesis to speed up the NN classifier. As far as we know, this work was 
the first attempt to use boosting in prototype selection, although their goal was not to improve the 
accuracy. The objective of PSBOOST is to obtain a good balance between storage requirements and 
generalization accuracy. Its principle is to use each instance as a weak hypothesis: the confidence 
weight given by boosting becomes in our case an indication of the instance’s relevance. Experimen- 
tal results indicate the efficiency of this approach (Nock and Sebban, 2001a). Inspired by boosting, 
PSBOOST suffers from the same important drawback: the control of the number of boosting rounds, 
that is, the size N, of the final prototype set in our framework. Nock and Sebban (2001a) studied 
the balance between a small value of N, which allows high storage reduction but decreases the ac- 
curacy, and a large value which allows us to control the generalization accuracy but still needs high 
storage requirements. The results obtained reveal the crucial need for a method fixing as accurately 
as possible this parameter (Nock and Sebban, 2001a). A first attempt to cope with this problem is 
provided by Sebban, Nock and Lallich (2001), but it holds only for problems with two classes. 

In this paper, we relax the class constraint, thereby extending our framework to multiclass prob- 
lems. We draw up a Statistical test based on the normalization factor Z, the criterion minimized 
in ADABOOST, and optimized in PSBOOST as well. Experimental results display the ability of 
this criterion to obtain a significant size reduction, together, on average, with an increase of the 
accuracy. This generalized version of PSBOOST, called PSBOOST2_MC, also displays experimen- 
tally its ability to address the first two problems (storage requirement and noise tolerance) of kK-NN 
classifiers. 
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A significant drawback of k-NN classifiers is that they require fixing k in advance. This is clearly 
not an easy task in real-world domains. While a small value of k is often sufficient for noise free 
problems, the k-NN rule requires thorough investigations for complex problems, often leading to 
the testing of many values of k. To cope with this problem, in this paper, we extend our algorithm to 
another kind of neighborhood-based classifier, whose geometry does not rely on ad hoc parameters. 
The underlying neighborhood graph is called the Relative Neighborhood Graph (RNG) (Toussaint, 
1980). Experimental results again display the ability of our algorithm to improve classifiers based 
on the RNG, even in the presence of noise. 


The final contribution of this paper is not restricted to data reduction. In Sebban, Nock and 
Lallich (2001), it is argued that the instance’s weighting rule derived from boosting deserves inves- 
tigations for its use in weighted nearest neighbors classifiers. We provide in this paper experimental 
results on a body of twenty-three datasets. They display significant improvements obtained when 
using boosting-derived weights. 


In the rest of this paper, after having briefly recalled the main properties of boosting and PS- 
BOOST in Section 2, we describe in Section 3 our statistical criterion for automatically halting the 
selection procedure, and the new version of our algorithm, called PSBOOST2. In Section 4, we 
describe the RNG, before presenting a large experimental study (Section 5). We make some obser- 
vations in Section 6, and we explain why PSBOOST2 is suited for reducing storage while controlling 
the classifier accuracy. In Section 7, we present the extension of the test to multiclass problems. The 
use of the instance weights in weighted classifiers is discussed in Section 8, before our final conclu- 
sion. 


2. Adapting Boosting to Data Reduction 


In this section, we recall the main properties of boosting and PSBOOST. 


2.1 Properties of Boosting 


Boosting resides in combining many (T) weak hypotheses produced from various distributions 
D,;(e) over the learning set (LS). The pseudocode of the original boosting algorithm, called AD- 
ABOOST (Freund and Schapire, 1997) is described in Figure 1. At each stage t, ADABOOST de- 
creases (resp. increases) the weight of learning instances, a priori labeled y(e), which are correctly 
(resp. incorrectly) classified by the current weak hypothesis h. Boosting thus forces the weak 
learner to learn the hardest examples. The weighted combination H (e) of all the weak hypotheses 
results in a better performing model. Schapire and Singer (1998) proved that, in order to mini- 
mize learning error, one must seek to minimize Z; in each round of boosting, requiring the use of a 
specific confidence Oy. 


In order to present our adaptation of boosting to storage reduction with neighborhood-based 
classifiers, we first introduce several notations proposed by Schapire and Singer (1998). Suppose 
that y(e) € {—1;1} and that the output of each weak hypothesis h, is restricted to —1,0,+1. Let 
W-!, W? and W*! be defined by 


w? = YD (e) . 
e€LS:y(e)h; (e)=b 
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ADABOOsST (LS,W,T ) 
Initialize distribution Dı(e) =1/|LS| 
for any e € LS; 
For t=1,2,...,T 
Train weak learner W on LS using D; 
and get a weak hypothesis hy; 
Compute the confidence a = j log( 4#); 
Where € = Yyve)\zn,(e)Dr(e) is the error 
of h. 
Update: 
Ve ELS: Daile) = POZA) , 
/*Z; is a Normalization Factor*/ 
endFor 


Return the classifier 


T 
H(e) = sign()? arh (e)) 


t=1 
Figure 1: Pseudocode for ADABOOST. 


Using symbols + and - for +1 and -1, we can calculate the normalization factor Z as: 


Z = 2 Pie) exp(—O,y(e)A;(e)) 


D,(e) exp(—a,b) 
b e€LS:y(e)h; (e)=b 


= W°+W~ exp(a,)+W* exp(—a,) . 


Z is then minimized when 


1 wr 


Freund and Schapire’s original ADABOOST algorithm would instead have made the more conser- 


vative choice 
1 wt+sw? 
(04 — = ——— 
2 °° \ w+ lw |” 





giving a normalization coefficient Z which Freund and Schapire (1997) upper bound by 
1 1 
Z < 24/(WHt+ AW + av) 


2.2 PSBOOST 


Suppose now that each weak hypothesis A, is not a classifier produced from the whole learning set 
(LS), but rather a given example e. In ADABOOST, the confidence 0, is a function of the prediction 
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error of h, on LS. Replacing h, by e requires a more sophisticated error measure that we can call 
the pseudo-loss, as used in Freund and Schapire (1996). While the loss of a classifier h, is based 
on its ability to correctly classify all the instances, the pseudo-loss of e must take into account its 
influence only on its neighborhood in LS. 


Definition 1 Let N(e) be the neighborhood of an instance e of the learning set LS: 
N(e) = {e' € LS: e' is one of the k-nearest neighbors of e in the oriented k-NN graph}. 


Note that the above definition can be extended to other neighborhood graphs. 


Definition 2 Let R(e) be the reciprocal neighborhood of an instance e of the learning set LS: 
R(e) = {e' ELS: e € N(e’)}. 


Stated differently, R(e) represents the set of instances which have e in their neighborhood. 
Whenever the neighborhood relationship can be represented by a directed graph, such as for the 
k-NN tule, we generally have R(e) 4 N (e). If we consider e as a weak hypothesis, its output takes 
three possible values in the case with two classes: 


e y(e) € {—1;1} for any instance in R(e), 
e 0 for any instance not in R(e). 


Let W; (resp. W,-) be the fraction of instances in R(e) having the same class as e (resp. a 
different class from e), and let we be the fraction of instances to which e gives a null vote (those not 
in R(e)). Then, the example e we choose at each round t of boosting should be the one minimizing 
the following coefficient: 


1 a. | 
Ze = 2 ( + zwe) (w. +w), (2) 


and the confidence @, can be calculated as 


1 ws + iwp 
Ge = ~log | ———— |]. (3) 





2 We +5W? 


Note that we use here the less optimal quantities given by Freund and Schapire (1997) and not 
those proposed by Schapire and Singer (1998). Our choice basically increases the influence of W2, 
since parameter W°? is absent from the weighting coefficient in Equation 1. This choice is motivated 
by the fact that in our case, many instances do not belong to the reciprocal neighborhood R(e) of 
some instance e, resulting in a value for W? eventually much higher than in the weak hypotheses that 
abstain Schapire and Singer’s (1998) model. In our approach, a small W? (of course combined with 
a high W,*) indicates a high local influence of e, and then is considered an interesting candidate for 
the selection. Note that once a prototype is selected, it will still be considered as in other reciprocal 
neighborhoods, but of course not as a candidate. 

The pseudocode of our algorithm PSBOOST is described in Figure 2. Note that, in this section, 
the confidence Qe is only used for selecting the prototypes and not for generating a weighted classi- 
fier, which is the subject of the last section of this paper. This choice is motivated by the fact that our 
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PSBOOST (LS,N,) 
Initialize distribution Dı(e) =1/|LS| 
for any e € LS; 
Initialize candidates set LS, = LS; 
Initialize LS’=0 
For t=1,2,...,Np 
e=argmingers, Ze; 
If Qe <0 Then EndLoop; 





ES S58 Ue 
LS, = LS, — {e} 
Update: 
Ve’ € R(e): 
Daile) = Pie jerp(—vextev(e) P 





Vel ELS\R(e): Duale’) = 282; 
/*Ze is a normalization coefficient*/ 
endFor 
Return LS’ 


Figure 2: Pseudocode for PSBOOST. The output of this algorithm is the prototype subset LS’. 


a PSRCG____PSBOOST 
[nates [tee [eRe ee ae Te 


AUDIOLOGY 
AUSTRAL 
BIGPOLE 
BREAST 
BRIGHTON 
BUPA 
ECHOCARDIO 
GERMAN 
GLASS2 
HARD 

HEART 
HEPATITIS 
HORSE 
IONOSPHERE 
LED+17 
LED 

PIMA 
VEHICLE 
WHITEHOUSE 
XD6 


AVERAGE | 74.75 | 75.08 81.3 | 74.19 624 75.46 | 73.70 | 68.31 8.7 | 70.80 





Table 1: Results (accuracy Acc. and percentage of selected instances % prot) for KNN (k = 5), CF, PSRCG, 
PSboost, MC (Monte-Carlo), RT3, PSboost*; PSboost (resp. PSboost*) means that PSboost is run 
with exactly the same number of prototypes than PSRCG (resp. RT3) 


original goal is to select the most relevant instances from LS. Once the selection is done, the output 
LS’ can be then used as a standard learning set. In order to assess the efficiency of our selection 
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method, we compare the performances of LS and LS’ without any other optimization strategy (for 
instance by generating a weighted classifier). 

Some useful observations can be made about the value of Ze and its contribution to removing 
irrelevant instances in LS. First, if an instance e belongs to a region with very few instances, it will 
not belong to many reciprocal neighborhoods, resulting in a large W2, preventing the achievement 
of small Ze. Secondly, if a prototype belongs to a region with evenly distributed instances, W;* and 
W, tend to be balanced, and this again, prevents to obtain small Ze. Note that with our strategy, 
a cluster of instances of the same class could be all picked for LS’, resulting in a redundancy in 
the final subset. A way to solve this drawback would consist in applying a post-process to remove 
redundancy. For example, Sebban and Nock (2000) proposed, in another context, to compute an 
information measure from a (k+ 1)-NN graph. Only instances at the center of clusters keep a 
null uncertainty with k+ 1 neighbors. Removing such instances allows the deletion of the useless 
instances from the clusters. 

Note in Figure 2 that the user must provide a value for N,, the number of prototypes. In this 
paper, we provide a theoretical framework for automatically determining N, using a statistical test. 
Nock and Sebban (2001a) carried out a large comparative study between PSBOOST and the state- 
of-the-art prototype selection algorithms for which we recall the main results (obtained by cross- 
validation) in Table 1. CF corresponds to the Consensus Filter (Brodley and Friedl, 1996), PSRCG 
was proposed by Sebban and Nock (2000), RT3 by Wilson and Martinez (2000), and MC corre- 
sponds to Monte-Carlo sampling as proposed by Skalak (1994) (for more details see Nock and 
Sebban (2001a)). Although these results are interesting, the parameter N, must be fixed in advance, 
and that constitutes a drawback for PSBOOST in its original version. 


3. Theoretical Stopping Criterion 


In this section, we describe our statistical criterion for automatically halting the selection procedure. 


3.1 A Random Framework for Test Construction 


In this section, we propose a theoretical framework for determining the number of weak hypotheses 
Np. Our strategy is based on a Statistical test. Let Ho be the null hypothesis of this test, which 
expresses the idea that a given e does not statistically contribute to give information about the 
labelling of its reciprocal neighborhood. Informally, as long as Ho can be kept, such an instance 
can be removed without reasonably endangering further classification tasks. This requires a statistic 
that assesses for a given candidate e the validity of Ho, and for which we provide the statistical law 
under Ho. For a given risk 0, we stop the selection if and only if all the candidates have a p-value 
higher than ©. Stated differently, the algorithm stops if the best current candidate does not allow 
the rejection of Ho with a risk smaller than O. We provide here a theoretical framework for binary 
problems. The extension to multiclass problems is discussed in Section 7. 

A possible way of proceeding consists in considering under Ho that, in the reciprocal neighbor- 
hood R(e), the true class Y is randomly distributed with a given probability To (if y(e) = 1) or 1 — To 
(if y(e) = —1). Two ways are possible to fix Tọ: 


1. Choose To equal to the global proportion in LS of positive learning instances (those for which 
y(e)=1). 
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2. Use To = 0.5 to satisfy a majority vote rule for a 2-class problem, often used in classification 
tasks. Stated differently, we test if e classifies instances in R(e) better than a simple coin toss. 


Let Ho(no) be the corresponding null hypothesis. Under Ho(1o), an instance of the reciprocal 
neighborhood R(e) belongs to the same class as e with probability To (resp. 1 — To) if y(e) = 1 (resp. 


y(e) = —1). 
3.2 Law of W," under Ho 


In our approach, an instance e is selected by minimizing the quantity Ze, while ensuring a positive 
confidence Qe (which avoids the selection of mislabeled instances). 





1 1 
Ze = 2 (We T We (We F We) 
Tel Al 
— 2 (We + zwo) -We Ee We) 3 
because W+ +W, +W? = 1. Then, Ze depends on the value of W;* in R(e): 
wy = +3 D,(e’) 


e ER(e):y(e')=y(e) 


Y Daaa: 
e'ER(e) 


where the boolean variable 1;,)-,(e)} is 1 iff y(e’) = y(e), and O otherwise. If Ho(To) is true, 
It(e)=y(e)} follows a binomial law B(1, p), where p = To if Y (e) = 1 else p = 1 — To. Considering 
that W,* depends on examples i,i = 1,2,..,|R(e)| (the size of the reciprocal neighborhood), we 
propose the following simplification: 


There are two different ways to construct the distribution of W,” under Ho to compute the critical 
value of W,*, called Wi, We recall here that the critical value defines the bound of the rejection 
region of Ho, and corresponds to the (1 — 8)-percentile of the distribution of W;* under Hp. In the 
two following approaches, we assume that the D,(i) are not random variables, even if in theory, 
they depend on the labels of the examples. First, the distribution can be assessed by a normal 
approximation. In this case, under Ho(%o), W,* is a weighted sum of |R(e)| variables J;, where the 
I; are independently and identically distributed. The mean and variance of W," are: 








IR(e)| 
E(W./Ho) = }, D(ÒDE(I) 
i=l 
R(e)| 
= p L D; (i) 
IR(e)| 
Var(W;"/Ho) = D? (i)Var(I;) 
i=l 
IR(e)| 
= p(1-p) }, Dr (i) 
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The other way to proceed would consist of simulating the distribution of W,*, which can deal with 
cases where the approximation constraints are not satisfied. For balanced weights (when W,* and 
W; are close), |R(e)| > 10 is enough to satisfy these constraints. In an unbalanced case, W,;* must 
be larger. 


3.3 Statistical Test 


Without a criterion for halting the selection, PSBOOST requires the provision of the number N, 
of weak hypotheses. Such a strategy may lead to the selection of an instance for which the null 
hypothesis Ho would not be rejected. By introducing a statistical test using the critical value We 
we keep only instances e for which W,* is exceptionally high under Hp (i.e., W,* > W2). Among 
these, we choose at a given stage of the selection the one that minimizes Z, or equivalently Z?. The 
procedure is stopped if for all the instances e, W,* < W+,- 


3.3.1 ASSESSING THE CRITICAL VALUE OF W; 


We assess We, either by normal approximation or by simulation, which is computationally expen- 
sive, but sometimes necessary if the approximation conditions are not satisfied. By approximation, 
we, is easily defined as follows: 





where u~¢ is the (1 — 8)-percentile of the normal law N(0,1). If the approximation constraints are 
not satisfied, we can artificially construct a distribution of W,*, by simulating |R(e)| independent 
observations J; according to B(1, p), and computing the weighted sum ike | D,(i)l;. By repeating 


this procedure N times, an estimate of Wr, is the (1 — 8)-percentile of the N samples. 
3.3.2 DECISION RULE 
An instance e is selected by minimizing Ze: 


1 
2 


1 


Ze = 24/ (We + 2 


We) (We + We) ’ 


while ensuring a positive confidence Oe: 
1 We + 4We 
Oe: Se IOB ee eel 2 
2 We +5W; 


At each stage of the selection, our procedure minimizes the quantity Z? = 4F (1 — F), where F = 
Wo + swe. The critical value of F with the risk 9 is directly deduced from Wi: 





1 
2 
IR(e)| i 


— p De Di(i) + 5We +u1-o 
i=1 


Fi Wii» + We 
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Under Ho(to), F,_, can in theory be smaller than 0.5 when p < 0.5. In this case, if two candidates 
satisfy the first condition (W;* > We); their confidences are then negative, and paradoxically we 
will choose the candidate e which presents the smaller value F, (F, € [F,_,,0.5]), by minimizing Ze. 
This situation, possible when p is very close to 0, in fact rarely occurs because there is almost always 
a candidate e’ for which F, , > 0.5 and F, > F,, (Fy € [F, g,1]), often resulting in Zy < Ze. This 
fact has been confirmed by an experimental study. Actually, on 18 datasets, using a 5-fold cross- 
validation resulting in 90 different databases, we noted that this situation never occurred. However, 
the neighborhood-based classifiers, such as the k-nearest-neighbors, usually use a majority decision 
rule with a threshold 0.5 (in the case of 2 classes). In such a context, it is more suitable to test 
the null hypothesis Ho(0.5), which means that we select only the instance e that classifies, in the 
reciprocal neighborhood R(e), significantly better than a simple toss. In this case, F > 1 — F, and 
then œ > 0. Under Ho(0.5), we have always p = 0.5, and the previous formulae for F,_, can be 
simplified: 





We deduce the critical values of Z? and o with the risk 0, called cg and @1—ẹ: 





co = (2\/F\_e(1—Fi_e))" 
, ROI, 
= l—uio $} Dr (i) 
i=l 
1 F 
a = =l 1—9 
DE ee Pans 
IR) 
1 +ui-o 2 D; (i) 
I= 
= -—lo 
Qe" IR(e)| 


Then, we select the instance e if and only if z2 < Cg OF Qe > O,_,. Note that, while we select the 
instance e for which Ze is minimum, we use in the decision rule the law of Z, and not the one of 
min, Ze. According to the level of dependence of Z/s, the risk is in fact contained between 6 and 
(8.|LS|). A simulation procedure would allow us to have more information about this problem. 
Then, note that O is more a control parameter than the probability of type 1 error. The new version 
of our algorithm, called PSBOOST2, is described in Figure 3. 


4. The Relative Neighborhood Graph 


While PSBOOST2 was originally proposed for improving the k-NN algorithm, our theoretical frame- 
work is independent of the geometrical structure used for the construction of the reciprocal neigh- 
borhood R(e). So, let us consider another neighborhood graph, called the Relative Neighborhood 
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PSBOOST2 (LS) 
Initialize D,(e)=1/|LS| for any e€ LS; 
Initialize candidates set LS, =LS; 
Initialize LS’=0 
Repeat 
Temp = {e' € LS, : W7 > W*,} 
e= arg MINecTemp Ze! j 
If Qe >Q,, Then 
Stop + False 
LS’ = LS' Ue 
LS, = LS, — {e} 
Update: 
Ve' € R(e): 


; D; (eje 2e) 
Dizi (e’) = He) Ze i 


Vel E€ LS\R(e):  Dys1(e") = P82; 
Else Stop « True 
endIf 
Until Stop=True 
Return LS’ 








Figure 3: Pseudocode for PSBOOST2. 


Graph (RNG). Introduced by Toussaint (1980), the RNG is a connected graph in which, if two 
instances are linked by an edge, then they satisfy the following property: 
b) < i : 
d(a,b) < a mend aca) 
This definition means that Lap, which corresponds to the intersection of two hyperspheres, with 
centers a and b and with radius equal to the distance between a and b, does not contain any other 
point of the learning set LS (Figure 4 describes an example). The RNG can naturally be used in a 


neighborhood-based classifier. We present here a general framework for problems with an arbitrary 
number of classes and an arbitrary geometrical structure used for building the neighborhood graph. 


Definition 3 Let C; be the set of learning instances belonging to the i-th class: Vi = 1,..,c, Ci = 
{e € LS: y(e) = i} where c is the number of classes. 


Definition 4 Let O(e') be the c-dimension vector whose components are noted O;(e'), i = 1,..,¢, 
each being the proportion of instances in the neighborhood of e' belonging to the i-th class: 


IN (e^) NC: 


ol) = eN 


,Vi= 1,2,..,c , 
where N(e’) is the set of neighbors of e' (linked by an edge to e') in the neighborhood graph. 
Note that definition 4 also applies to new instances, not belonging to the learning set. 
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Definition 5 Let o(e’) be the class given to e' by the classifier » from the neighborhood graph (RNG 
or k-NN): 


ọ(e') = argmaxO;(e’) . 


According to these definitions, the new instance e’ in Figure 4 would be labeled “black” from its 
neighbors 1, 2 and 3. 





Figure 4: Relative Neighborhood Graph: the intersection of the two hyperspheres does not contain any 
instance of the learning set. 


5. Experimental Results 


In this section, we assess the efficiency of PSBOOST2 according to the two following performance 
measures: generalization accuracy and storage reduction. We used 18 datasets, most of which 
come from the UCI database repository (Merz and Murphy, 1996). The experimental method was 
the following: a f-fold cross-validation (here f = 5) was performed on each database to obtain 
estimates of the true performance of the classifier. We used two neighborhood-based classifiers 
according to the geometrical structures listed above (k-NN, here k = 3, and the RNG). The decision 
rule used for classifying an instance consists of a majority vote of the neighbors. Each database 
DB is divided into f disjoint sets DB;. PSBOOST2 is applied on each combination DB — DB;. The 
classifier uses the resulting subset of instances (DB — DB;) subset for classifying the instances in DB;. 
For each classifier, we obtain an accuracy estimate by averaging results over the f sets. 

Note that we did not conduct a large comparative study between PSBOOST2 and the state-of- 
the-art prototype selection algorithms because it was already carried out for PSBOOST by Nock and 
Sebban (2001a), of which the main results are described in Table 1. These results have shown the 
difficulties that the standard prototype selection algorithms have in controlling the two performance 
measures. From the results described in Table 2, we can make the following remarks: 


1. The learning set size is highly reduced (nearly 45% of the original size on average), while 
controlling the generalization accuracy. While the accuracy is slightly reduced for the Rel- 
ative Neighborhood Graph by an amount that is not significant using a Student paired f-test 
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KNN PSBOOST2 RAN | RNG PSBOOST2 RAN 
ECHO 59.2 63.4 : 56.3 62.7 
HEPAT. 83.1 79.9 : 73.0 75.5 
HEART 78.1 82.1 é 74.1 74.8 
AUDIO T2712 70.9 60.3 
BIGPOLE | 59.5 60.2 54.6 58.2 
HORSE 72.3 73.4 64.3 67.5 
IONO 80.4 80.4 A295 ABS: 
XD6 79.8 79.1 79.5 71.0 
BREAST 96.7 96.9 95.5 94.5 
91.4 92.0 : 91.1 89.5 
71.9 72.0 . 67.7 66.5 
50.0 48.3 54.8 64.8 
73.5 76.5 ; 74.0 68.1 
83.9 88.1 : 88.7 85.1 
69.8 69.3 i 69.6 69.1 
AUSTRAL | 79.7 76.8 X 76.8 73.9 
GERMAN | 69.9 71.3 70.0 70.6 
VEHICLE | 70.9 70.3 71.9 71.7 
74.7 75.2 47 72.9 | 72.5 72.1 





Table 2: Effect of PSBOOST2 on learning set size and generalization accuracy on 18 datasets; k-NN, RNG 
correspond respectively to the accuracy on DB;, using the whole learning set, with a 3-NN classifier 
and a voting rule based on the RNG; PSBOOST2 is described by its accuracy (Acc.) and its storage 
requirement (% pr); RAN corresponds to the accuracy achieved from a learning subset of same size 
(LS’) randomly selected in |LS]. 


over accuracies, the predictive accuracy of the post-PSBOOST2 nearest neighbor classifier is 
increased (74.7% vs. 75.2%), even though this superiority is not significant with a p-value 
near 0.5. Therefore, it seems to confirm experimentally that PSBOOST2 is suited to control 
the generalization accuracy while significantly reducing the data. 


2. A simple strategy for assessing the relevance of PSBOOST2 consists in comparing the se- 
lected subset (LS1) with another one (LS) of the same size but randomly selected from LS. 
Such a procedure allows one to estimate the quality of the selected prototypes. We made this 
comparison (columns PSBoost2/Acc. and Ran in Table 2). Our strategy achieves a signif- 
icantly higher accuracy than a random one, and this also tends to confirm the efficiency of 
PSBOOST2. 


6. Some Insights into the Performances of PPBOOST2 


In this section, we explain why PSBOOST2 is suited for reducing storage while controlling the 
classifier accuracy. 
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6.1 PSBOOST2 and Margin Maximization 


A partial explanation of PSBOOST2’s performances may rely on the margin maximization principle. 
This principle is in fact not recent, and was originally suggested in Vapnik (1982) for support vector 
machines (SVMs) with optimal margins. Even though the objective in both approaches consists in 
finding classifiers which maximize margins on learning data, a detailed study of their mechanisms 
shows that they slightly differ (Schapire et al., 1998). In SVMs the sum of squared outputs of the 
base hypotheses and the sum of the squared weights are both assumed to be bounded (lh norm), 
while in boosting the maximum value of the base hypotheses (/.. norm) and the sum of the absolute 
values of the weights (/; norm) are assumed to be bounded. Support vector machines give rise to 
a quadratic programming problem, whereas the optimization in boosting can be seen as a linear 
programming problem. 

In Schapire et al. (1998), the authors prove that achieving a large margin on LS results in an 
improved bound on the generalization. They also prove that ADABOOST is suited to maximizing 
the number of learning examples with large margin. They define classification margin as the dif- 
ference between the weight assigned to the correct label and the maximal weight assigned to any 
single incorrect label. The margin is then a number in the range [-1,+1] and an example is correctly 
classified if it has a positive margin. The margin also corresponds to a degree of confidence in the 
classification. In order to assess the effect of PSBOOST2 for maximizing margins, we computed 
for the k-NN classifier the margin gain g; for each dataset i over the 5 folds (before and after PS- 
BOOST2). We first observe that over the 18 datasets, the average margin gain G = b Egi = 0.24. 
This might be an experimental explanation for the accuracy’s control in PSBOOST2. Even more, 
a second observation displays the ability of PSBOOST2 to increase margins, as all datasets have a 
margin gain g; > 0. 


6.2 The Filter Precision of PSBOOST2 


Brodley and Friedl (1996) provided a method for evaluating the ability of a data reduction technique 
to identify and eliminate mislabeled instances (called filter precision). This procedure in a way 
assesses the sensitivity to noise. Consider a learning set artificially corrupted by a given percentage 
of noise. One defines the 3 following sets: the set D of instances discarded, the set M of instances 
a priori corrupted, the set M N D of corrupted instances discarded by the data reduction technique. 
Brodley and Friedl defined P(E) as an estimate of the probability of retaining bad data: 


mi 
While the original 18 datasets probably already contain noisy data, we decided to calculate P(E) 
for different artificial noise levels. We corrupted the original data successively with 5, 10, ..., 35% 
noise. Table 3 reports P(E) averaged over all datasets and all folds for the k-NN and the RNG 
classifiers. 

In the presence of noise, the subset of instances (described by its accuracy Accg;) selected 
by PSBOOST2 is always better than the original learning set (Accper). The accuracy is actually 
improved after prototype selection and this trend seems to speed up with the noise level. This 
phenomenon is not really surprising. Indeed, noise smoothes class distributions near their frontiers. 
These “dangerous regions” tend precisely to be discarded by PSBOOST2. 


876 


STOPPING CRITERION FOR BOOSTING 


NOISE P(E) WITH kNN P(E) WITH RNG 
AcChep ACCafşt P(E) Accpep Accafr P(E) 
5% 71.7 72.5 0.07 68.6 68.7 0.15 
10% 67.9 69.3 0.08 65.9 66.7 0.15 
15% 64.1 67.6 0.07 62.5 63.9 0.17 
20% 63.5 66.0 0.08 59.1 60.6 0.16 
25% 61.2 64.1 0.08 58.5 59.5 0.17 
30% 58.7 61.1 0.08 56.4 58.3 0.19 
35% 56.3 60.1 0.09 54.1 56.1 0.18 


Table 3: PSBOOST2’s filter precision 


7. Extension to Multiclass Problems 


In this section, we present the extension of the test to multiclass problems. 


7.1 Test on Z; 


So far, we have only treated binary problems. Many real-world learning problems are in fact mul- 
ticlass with many more possible labels. Two main strategies have been proposed to deal with this 
extension to multiclass problems. The first one consists in creating one binary problem for each 
of the c classes. Then, we test one class j against all the other classes, answering the following 
question: “Does the example belong to the j” class or not?” This approach is called one-against-all 
(Allwein, Schapire and Singer, 2000). The second one consists in testing all pairs of classes (Hastie 
and Tibshirani, 1998). For each distinct pair of classes c,,c2, the examples labeled cı are consid- 
ered positive, those labeled c2 are negative. All other examples are ignored. This approach is called 
all-pairs. An interesting comparison is presented in Allwein, Schapire and Singer (2000). In our 
approach, we decided to choose the first method (one-against-all) which requires the construction 
of c binary problems. 

In the test proposed for solving binary problems (see Section 3.3), a candidate is selected when 
the corresponding Ze = 2\/F.(1—F,) is minimum (where F, = W} + swe), while F, > Fẹ. 
We recall that F\_» is the critical value of F, at the risk O under Ho (to), the hypothesis that the true 
class is randomly attributed with a given probability o, in the reciprocal neighborhood R(e). 

In this section, for multiclass problems, we denote by Fj, the value of F, when the class j 
is tested against the others. We propose to select the candidate e for which the quantity Ze = 
2\/F-(1—F-) is minimum, when F; is defined as follows: 


The suspensive condition to select e is the following: F, > F\_9, where Fẹ is the critical value 
of F, at the risk O under the null hypothesis. When the class j is tested against the others, the 
null hypothesis, denoted by Ho (Tjo), means that the class j is randomly distributed with a given 
probability Tjo in Rj(e), the reciprocal neighborhood of e when the class j is tested against the 
others. Then, note that R ;(e) changes with the value j. In order to find the critical value F_9, we 
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have to define m and G” such as: 


w= E(F-) 
1 Č 
= -} E(Fje) 
c 
R% 1 
T -}( pjD (e) +5 Wie) 
© j=1 e'éR;(e) 
© = Var(F.) 
1 Cc 
= aL Var(Fje) 
j=l 
1 £ ; 
= =} pj pj)D (e), 
e j=l e'ER;(e) 


where D; (e') is the distribution at the stage ¢ of the boosting, when the class j is tested against all 
the others. We note p; = jo if Y (e) = j else pj = 1 — T jo. According to the simplification proposed 
in Section 3.2, 


1c, Ril 

B= -}(} pDiÒ+ lwo) 
c EA 2 

rö 1 £ IR; (e)| 

= ad, dL Pi — pj)Dji (i) . 


We assume in the calculation of 5° the independence of the F je- Said differently, we consider that 
the knowledge of R j(e), from which F; e is computed, does not contain information about the nature 
of the reciprocal neighborhood R;(e), when j Æ /. Actually, even if the quantity |R ;(e)| remains the 
same V j, the labels and the weights of the neighbors in R;(e) will differ according to the tested class 
j. From this point of view, covariances can be considered as insignificant. 

Moreover, note that variables F; e are computed from independent variables, then they are not 
too far from a normal distribution. Furthermore, as mentioned before, they are approximately inde- 
pendent. Then, we can claim that F, is very close to a normal distribution. We can determine the 
critical values F 9 and cg for F, and Z?: 


Fi-9 = B+uU\-90 
4F _9(1—F1_0) . 


ce 


Note that for the special case where p; = 0.5 (for satisfying an absolute decision rule), the previous 
formulae are highly simplified. Actually, 


IR j(e)| 


= 1 
3 pjD j(i) i) +5 Ew, = (Wi +W) +W? 


2 


NI= NI= 
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PSBOOST2_MC (LS) 
Initialize Dj(e)=1/|LS| for any e € LS; 
Initialize candidates set LS, = LS; 
Initialize LS’=0 
Repeat 
Temp = {e' € LS, :W, > Wt 
e= arg MINecTemp Ze! j 
If Qe >Q Then 
Stop + False 
LS’ = LS' Ue 
LS, = LS, — {e} 
Update: 
For j=l, 2, eara 


Ve’ € R;(e): 
Di (ee %eMOlC) j)MOle), j) 
Djr+i(e') =O Z i 
D / 
Vel € LS\R;(e): Duale) = z ), 
EndFor 
endIf 
Else Stop « True 


Until Stop=True 











Return LS’ 
Figure 5: Pseudocode for PSBOOST2_MC. 
Then, 
ert 
p= YG) 
oh 2 
a l 
ae 
And, 
E 1 £ [R)I 5 


1 i 


Il 
a 


J 


The pseudocode of our extended algorithm, called PSBOOST2_MC, is described in Figure 5. Note 
that we use in this algorithm the coding matrix M(y(e), 7) which was originally given by Dietterich 
and Bakiri (1995). For the one-against-all approach, M is a c x c matrix in which all diagonal 
elements are positive (+1) and all other elements are negative (—1). When a class j is tested against 
the others, the current label of the instance e is the value M(y(e),j), where y(e) € {1,2,..,c}. 


7.2 Experimental Results 


Table 4 presents the properties (name, number of classes, learning set size and number of features) 
of the eight tested datasets. In order to assess the relevance of our multiclass statistical test, we 
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Es] 


WAVES 21 
ABALONE 


GLASS 
BALANCE 


IRIS 

LED 

LED+17 
DERMATOLOGY 





Table 4: Multiclass classification problems. 


Accuracy 


72 


71 


70 


69 


68 











Figure 6: Contribution of PSBOOST2_MC on multiclass problems: the solid line corresponds to the accu- 
racy of a standard k-NN classifier, built from the whole learning sample; the dashed-line represents 
the success rate computed from the reduced learning set. 


used many values of k (k = 1,2,.., 10) in the k-nearest neighbor classifier. Except for this detail, the 
experimental method remains the same as the previous study, namely the 5-fold cross-validation. 
A graphic synthesis of the results is presented on Figure 6. Each point of this figure is the average 
over the eight datasets, each of them tested five times during the cross-validation. Therefore, one 
point corresponds to the average of forty accuracies. Beyond data reduction, the results display 
the positive contribution of PSBOOST2_MC to the accuracy’s increase: for all values of k, the 
accuracies achieved from the reduced learning set are indeed higher than without data reduction. 
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8. Weighted Classifiers using Instance Confidences 


Beyond prototype selection, this section aims at exploring an issue that was raised by Sebban, Nock 
and Lallich (2001): the use of boosting-derived weights for weighted nearest neighbor rules. In 
such a context, the classification rule (as defined in Section 4) must be slightly modified, since the 
classification rule does not handle classes anymore, but real weights in favor of each class. The 
following definition for O(e’) replaces Definition 4: 


Definition 6 Let O(e') be the c-dimension vector whose components are noted O;(e’), i= 1,2,..,, 
each being the sum of weights of the instances in the neighborhood of e' belonging to the i-th class: 


Ole’) = Ye Oe ete 
e€N(e'):y(e)=i 


Note that Qe is still the confidence of the instance e when e is selected, but we end up selecting all 
instances. The weighting algorithm is a slight variant of PSBOOST2_MC, in which the condition 
W >W, is removed. This little algorithmic difference is crucial, as some instances may now have 
a negative weight. This still makes sense, because the new rule leverages the neighborhood vote in 
favor of some classes, or in disfavor of others when negative weights abound. 

Experimental studies have been conducted with a k-nearest neighbor classifier, for k = 1,2, ..,20. 
We applied our approach on twenty-three datasets. Rather than presenting the twenty-three curves 
(one for each dataset), we synthesize the results in one figure, where each point is the average of 
5 (folds) x 23 (datasets) = 115 accuracies. Results are presented in Figure 7. It appears that the 
performance of the standard k-NN rule is almost systematically improved by leveraging votes with 
the boosting weights. Even more, a Student paired f-test reveals that the difference between the 
standard k-NN and our weighted k-NN is significant for all values k = 1,2,..,11. For k large enough 
(k > 12), the difference becomes insignificant. This can be explained by the fact that large values 
of k tend to smooth neighborhood distributions (ultimately, they become the whole sample’s), for 
which weighting brings no significant difference. 

Another concise way to display the results consists in putting separately the results for each 
dataset, as an average over the different values of k. Instead of identifying the good values of k, 
we identify the good datasets, candidate for an improvement with our weighted nearest neighbor 
rule. We choose to take into account only the values of k < 12, for which weighting brings on 
average a Statistical advantage. The results are presented in Table 5 and graphically represented in 
Figure 8. We can note that for 17 datasets, a weighted decision rule provides better results than the 
unweighted rule. Among them, 7 datasets (Balance, Echocardiogram, German, Horse Colic, Led, 
Pima and Vehicle) see important improvements, ranging from 1% to > 5%. In contrast, only one 
dataset sees significant accuracy decrease (Car, 96.0% vs. 93.9%). 


9, Conclusions and Future Research 


This paper explores a method for prototype selection based on boosting, and gives statistical criteria 
for stopping the selection of instances, a crucial problem for the approach (Nock and Sebban, 2001a) 
as well as for usual boosting algorithms. The whole approach is cast into multiclass classification 
problems, thereby relaxing the class cardinality constraint of Sebban, Nock and Lallich (2001). 
So far, the framework proposed in this paper holds only for neighborhood-based classifiers. An 
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Accuracy 


76 


75 


74 


73 


72 











71 
23 45 67 8 9 10 11 12 13 14 15 16 17 18 19 20 k 


Figure 7: Comparison between a standard k-NN classifier (solid line) and a weighted classifier using the 
relevance of each instance (dashed-line). 


interesting direction of research consists in finding such a method tailored to processing data for 
induction algorithms, such as, for example, decision tree induction. 


Furthermore, we have shown that instead of reducing the learning set size, the boosting-derived 
weights can be experimentally used in weighted nearest neighbor rules, with statistical advantage 
compared to the usual, unweighted rules. Because it boils down to making boosting with instances 
as weak learners that abstain, and because nearest neighbor rules are among the earliest, simplest 
and still widely used classifiers, this algorithm certainly deserves theoretical investigations to cast, 
among all, the boosting theory and results (Freund and Schapire, 1997; Schapire and Singer, 1998). 
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